﻿<p>
  In the last chapter we used the S&amp;P 500 index to predict Amazon stock returns. Now we will add more variables to improve our model's predictions. In particular, we shall consider Amazon's competitors.
</p>

<div class="section-example-container">

<pre class="python">import numpy as np
import pandas as pd
import quandl
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm

# Get stock prices
quandl.ApiConfig.api_key = 'tAyfv1zpWnyhmDsp91yv'
spy_table  = quandl.get('BCIW/_SPXT')
amzn_table = quandl.get('WIKI/AMZN')
ebay_table = quandl.get('WIKI/EBAY')
wal_table  = quandl.get('WIKI/WMT')
aapl_table = quandl.get('WIKI/AAPL')</pre>
</div>

<p>
  Then we fetch closing prices starting from 2016:
</p>

<div class="section-example-container">

<pre class="python">
spy  = spy_table .loc['2016',['Close']]
amzn = amzn_table.loc['2016',['Close']]
ebay = ebay_table.loc['2016',['Close']]
wal  = wal_table .loc['2016',['Close']]
aapl = aapl_table.loc['2016',['Close']]
</pre>
</div>

<p>
  After taking log returns of each stock, we concatenate them into a DataFrame, and print out the last 5 rows:
</p>

<div class="section-example-container">

<pre class="python">
spy_log  = np.log(spy.Close) .diff().dropna()
amzn_log = np.log(amzn.Close).diff().dropna()
ebay_log = np.log(ebay.Close).diff().dropna()
wal_log  = np.log(wal.Close) .diff().dropna()
aapl_log = np.log(aapl.Close).diff().dropna()
df = pd.concat([spy_log,amzn_log,ebay_log,wal_log,aapl_log],axis = 1).dropna()
df.columns = ['SPY', 'AMZN', 'EBAY', 'WAL', 'AAPL']
df.tail()
</pre>
</div>

<table cellspacing="0" cellpadding="3" class="table qc-table">
    <tr>
        <td>Date</td>
        <td align="right">SPY</td>
        <td align="right">AMZN</td>
        <td align="right">EBAY</td>
        <td align="right">WAL</td>
        <td align="right">AAPL</td>
    </tr>
    <tr>
        <td>2016-12-23</td>
        <td align="right">0.001351</td>
        <td align="right">-0.007531</td>
        <td align="right">0.008427</td>
        <td align="right">-0.000719</td>
        <td align="right">0.001976</td>
    </tr>
    <tr>
        <td>2016-12-27</td>
        <td align="right">0.002254</td>
        <td align="right">0.014113</td>
        <td align="right">0.014993</td>
        <td align="right">0.002298</td>
        <td align="right">0.006331</td>
    </tr>
    <tr>
        <td>2016-12-28</td>
        <td align="right">-0.008218</td>
        <td align="right">0.000946</td>
        <td align="right">-0.007635</td>
        <td align="right">-0.005611</td>
        <td align="right">-0.004273</td>
    </tr>
    <tr>
        <td>2016-12-29</td>
        <td align="right">-0.000247</td>
        <td align="right">-0.009081</td>
        <td align="right">-0.001000</td>
        <td align="right">-0.000722</td>
        <td align="right">-0.000257</td>
    </tr>
    <tr>
        <td>2016-12-30</td>
        <td align="right">-0.004601</td>
        <td align="right">-0.020172</td>
        <td align="right">-0.009720</td>
        <td align="right">-0.002023</td>
        <td align="right">-0.007826</td>
    </tr>
</table>

<p>
  As before, we use the 'statsmodels' package to perform simple linear regression:
</p>

<div class="section-example-container">

<pre class="python">simple = sm.ols(formula = 'amzn ~ spy', data = df).fit()
print simple.summary()
</pre>
</div>

<table cellspacing="0" cellpadding="3" class="table qc-table">
    <tr>
        <td>&nbsp;</td>
        <td align="right">coef</td>
        <td align="right">std err</td>
        <td align="right">t</td>
        <td align="right">P>|t|</td>
        <td align="right">[0.025</td>
        <td align="right">0.975]</td>
    </tr>
    <tr>
        <td>Intercept</td>
        <td align="right">9.876e-05</td>
        <td align="right">0.001</td>
        <td align="right">0.097</td>
        <td align="right">0.923</td>
        <td align="right">-0.002</td>
        <td align="right">0.002</td>
    </tr>
    <tr>
        <td>spy</td>
        <td align="right">1.0796</td>
        <td align="right">0.124</td>
        <td align="right">8.725</td>
        <td align="right">0.000</td>
        <td align="right">0.836</td>
        <td align="right">1.323</td>
    </tr>
</table>

<p>
  Similarly, we can build a multiple linear regression model:
</p>

<div class="section-example-container">

<pre class="python">model = sm.ols(formula = 'amzn ~ spy + ebay + wal', data = df).fit()
print model.summary()
</pre>
</div>

<table cellspacing="0" cellpadding="3" class="table qc-table">
    <tr>
        <td>&nbsp;</td>
        <td align="right">coef</td>
        <td align="right">std err</td>
        <td align="right">t</td>
        <td align="right">P>|t|</td>
        <td align="right">[0.025</td>
        <td align="right">0.975]</td>
    </tr>
    <tr>
        <td>Intercept</td>
        <td align="right">0.0001</td>
        <td align="right">0.001</td>
        <td align="right">0.134</td>
        <td align="right">0.894</td>
        <td align="right">-0.002</td>
        <td align="right">0.002</td>
    </tr>
    <tr>
        <td>spy</td>
        <td align="right">1.0468</td>
        <td align="right">0.170</td>
        <td align="right">6.155</td>
        <td align="right">0.000</td>
        <td align="right">0.712</td>
        <td align="right">1.382</td>
    </tr>
    <tr>
        <td>ebay</td>
        <td align="right">-0.0795</td>
        <td align="right">0.058</td>
        <td align="right">-1.364</td>
        <td align="right">0.174</td>
        <td align="right">-0.194</td>
        <td align="right">0.035</td>
    </tr>
    <tr>
        <td>wal</td>
        <td align="right">-0.0865</td>
        <td align="right">0.089</td>
        <td align="right">-0.976</td>
        <td align="right">0.330</td>
        <td align="right">-0.261</td>
        <td align="right">0.088</td>
    </tr>
    <tr>
        <td>aapl</td>
        <td align="right">0.1529</td>
        <td align="right">0.084</td>
        <td align="right">1.831</td>
        <td align="right">0.068</td>
        <td align="right">-0.012</td>
        <td align="right">0.317</td>
    </tr>
</table>

<p>
  As seen from the summary table, the p-values for Ebay, Walmart and Apple are 0.174, 0.330 and 0.068 respectively, so none of them are significant at a 95% confidence level. The multiple regression model has a higher \( R^2 \) than the simple one: 0.254 vs 0.234. Indeed, \( R^2 \) cannot decrease as the number of variables increases. Why? If an extra variable is added to our regression model, but it cannot account for variations in the response (amzn), then its estimated coefficient will simply be zero. It's as though that variable was never included in the model, so \( R^2 \) will not change. However, it is not always better to add hundreds of variables or we will overfit our model. We'll talk about this in a later chapter.
</p>

<p>
  Can we improve our model further? Here we try the <strong>Fama-French 5-factor model</strong>, which is an important model in asset pricing theory. We will cover it in the later tutorials.
</p>

<p>
  The data needed are publicly available on French's website. We have saved a copy for convenience. The following code fetches the data.
</p>

<div class="section-example-container">

<pre class="python">
import urllib2
from datetime import datetime

url = 'https://www.quantconnect.com/tutorials/wp-content/uploads/2017/08/F-F_Research_Data_5_Factors_2x3_daily.csv'
response   = urllib2.urlopen(url)
fama_table = pd.read_csv(response)

# Convert time column into index
fama_table.index = [datetime.strptime(str(x), "%Y%m%d")
                    for x in fama_table.iloc[:,0]]
# Remove time column
fama_table = fama_table.iloc[:,1:]
</pre>
</div>

<p>
  With the data, we can construct a Fama-French factor model:
</p>
<div class="section-example-container">

<pre class="python">
fama = fama_table['2016']
fama = fama.rename(columns = {'Mkt-RF':'MKT'})
fama = fama.apply(lambda x: x/100)
fama_df = pd.concat([fama, amzn_log], axis = 1)
fama_model = sm.ols(formula = 'Close~MKT+SMB+HML+RMW+CMA', data = fama_df).fit()
print fama_model.summary()
</pre>
</div>
<img class="img-responsive" src="https://cdn.quantconnect.com/tutorials/i/Tutorial10-fama.png" alt="fama" />
<p>
  The Fama-French 5-factor model has a higher \( R^2 \) of 0.387. We can compare the predictions from simple linear regression and Fama-French multiple regression by plotting them together on one chart:
</p>
<div class="section-example-container">

<pre class="python">
result = pd.DataFrame({'simple regression': simple.predict(),
                       'fama_french': fama_model.predict(),
                       'sample': df.amzn}, index = df.index)

# Feel free to adjust the chart size
plt.figure(figsize = (15,7.5))
plt.plot(result['2016-7':'2016-9'].index,result.loc['2016-7':'2016-9','simple regression'])
plt.plot(result['2016-7':'2016-9'].index,result.loc['2016-7':'2016-9','fama_french'])
plt.plot(result['2016-7':'2016-9'].index,result.loc['2016-7':'2016-9','sample'])
plt.legend()
plt.show()
</pre>
</div>
<img class="img-responsive" src="https://cdn.quantconnect.com/tutorials/i/Tutorial10-compare.png" alt="compare" />
<p>
  Although it's hard to see from the chart above, the predicted return from multiple regression is closer to the actual return. Usually we don't plot the predictions to determine which model is better; we read the summary table.
</p>
